Use Thrift macro to generate Parquet LogicalType serialization code#9997
Use Thrift macro to generate Parquet LogicalType serialization code#9997etseidl wants to merge 17 commits into
LogicalType serialization code#9997Conversation
LogicalType serialization code
|
Depends on #9996, so keeping draft for now |
|
run benchmark metadata |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing logical_type_macro (39a4bc1) to fd1c5b3 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark parquet_round_trip |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing logical_type_macro (50c5168) to fd1c5b3 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark metadata |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing logical_type_macro (b9ff17a) to 73c513a (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
| // An earlier version of this crate enforced this when decoding LogicalType. Now that | ||
| // the decoder is macro generated, we do this to preserve the original behavior. | ||
| // TODO: this was done due to a line in the spec saying an unset algorithm defaults | ||
| // to SPHERICAL. But there is no default in the Thrift, so it would be better to set the | ||
| // default when consumed. | ||
| // See https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#geography | ||
| for se in schema.iter_mut() { | ||
| match se.logical_type.as_mut() { | ||
| Some(LogicalType::Geography(g)) if g.algorithm.is_none() => { | ||
| g.algorithm = Some(Default::default()); | ||
| } |
There was a problem hiding this comment.
@paleolimbot I'm wondering if this is still necessary, or if this can be enforced where the logical type is consumed. For instance, the one place I could find where removing this logic would have an impact is when handling the extension type.
LogicalType::Geography(geography) => {
let algorithm = geography.algorithm.map(|a| a.try_as_edges()).transpose()?;
let md = parquet_geospatial::WkbMetadata::new(geography.crs.as_deref(), algorithm);
...
}could instead be
LogicalType::Geography(geography) => {
// if algorithm is None, then set to the default (EdgeInterpolationAlgorithm::SPERICAL).
let algorithm = geography.algorithm.or(Some(Default::default()));
let algorithm = algorithm.map(|a| a.try_as_edges()).transpose()?;
let md = parquet_geospatial::WkbMetadata::new(geography.crs.as_deref(), algorithm);
...
}Or we could add an impl for GeographyType with an algorithm() function that does the same.
Would removing this cause a lot of thrash downstream?
There was a problem hiding this comment.
You can do whatever is best for you here...the extension type translation and me ( https://github.com/apache/sedona-db/blob/c2613699056b67c001a7ef6899e8568f91bc6596/rust/sedona-geoparquet/src/metadata.rs#L657-L674 ) are probably the only two actual users of this. That said, it is probably less confusing to always apply the Spherical default (most people consuming this API are non-spatial and have no idea what any of this means).
A heads up that there's a bug in that extension type (I will fix but haven't gotten there yet): #9929
There was a problem hiding this comment.
Awesome, thanks for the quick response! I'll go ahead and remove this then and add the algorithm() function I mentioned. I'll also update the API docs for the struct to make clear that if the field is None, SPHERICAL should be used.
A heads up that there's a bug in that extension type (I will fix but haven't gotten there yet): #9929
👍 Thanks for the heads up!
Which issue does this PR close?
LogicalTypeenum to macro generated version #9995.Rationale for this change
See issue. Improve code maintainability by using thrift macro to generate
LogicalTypeserialization code.What changes are included in this PR?
Adds a new macro to generate code for a Thrift
unionthat needs to be forward compatible. Does this by adding a catchall_Unknownvariant for unknown field ids.Are there any user-facing changes?
Yes this is a breaking API change because the
LogicalTypeenum will now use tuple variants rather than struct. This also makes public some structs that were previously private.